ORC: Fix importing ORC files with float and double columns and test#3332
Merged
rdblue merged 14 commits intoapache:masterfrom Oct 29, 2021
Merged
Conversation
kbendick
commented
Oct 20, 2021
kbendick
commented
Oct 20, 2021
| if (type.typeId() == Type.TypeID.FLOAT) { | ||
| max = ((Double) max).floatValue(); | ||
| } | ||
| } |
Contributor
Author
There was a problem hiding this comment.
I tested without this fix and encountered the error, so I'm sure that the test does in fact hit the problem.
rdblue
reviewed
Oct 20, 2021
...ark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java
Outdated
Show resolved
Hide resolved
rdblue
reviewed
Oct 20, 2021
...ark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java
Show resolved
Hide resolved
yyanyy
reviewed
Oct 21, 2021
rdblue
approved these changes
Oct 26, 2021
Contributor
|
This looks good to me. To unblock the 0.12.1 release, we should get this in and fix the minor issue with the tests later. |
bc41476 to
8e04370
Compare
Contributor
Author
kbendick
commented
Oct 27, 2021
rdblue
reviewed
Oct 27, 2021
rdblue
reviewed
Oct 27, 2021
rdblue
reviewed
Oct 29, 2021
| // To avoid storing NaN in the Iceberg metrics, NaN is normalized to +/- Infinity for max / min respectively. | ||
| private static Object normalizeFloatingPointColumnsIfNeeded(Bound bound, Type type, double value) { | ||
| if (type.typeId() == Type.TypeID.DOUBLE) { | ||
| return Double.isNaN(value) ? (bound == Bound.UPPER ? Double.POSITIVE_INFINITY : Double.NEGATIVE_INFINITY) : value; |
Contributor
There was a problem hiding this comment.
What about a simpler method, replaceNaN(double value, double replacement)? Then you just need the ternary check here and you can customize with Double.NEGATIVE_INFINITY in the call rather than passing a Bound.
Also, you only need one since the bound is always a double to begin with. So you can replace the logic with:
max = replaceNaN(((DoubleColumnStatistics) columnStats).getMaximum(), Double.POSITIVE_INFINITY);
if (type.typeId() == Type.TypeID.FLOAT) {
max = ((Double) value).floatValue;
}
Contributor
Author
There was a problem hiding this comment.
Ah that's a good idea. Wasn't a fan of the double ternary.
Contributor
Author
|
Ah that's good idea. Wasn't a big fan of the multiple ternaries.
…On Fri, Oct 29, 2021 at 10:20 AM Ryan Blue ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java
<#3332 (comment)>:
> @@ -262,6 +269,16 @@ private static Metrics buildOrcMetrics(final long numOfRows, final TypeDescripti
return Optional.ofNullable(Conversions.toByteBuffer(type, truncateIfNeeded(Bound.UPPER, type, max, metricsMode)));
}
+ // ORC uses NaN in its metrics for floating point numbers (float and double).
+ // To avoid storing NaN in the Iceberg metrics, NaN is normalized to +/- Infinity for max / min respectively.
+ private static Object normalizeFloatingPointColumnsIfNeeded(Bound bound, Type type, double value) {
+ if (type.typeId() == Type.TypeID.DOUBLE) {
+ return Double.isNaN(value) ? (bound == Bound.UPPER ? Double.POSITIVE_INFINITY : Double.NEGATIVE_INFINITY) : value;
What about a simpler method, replaceNaN(double value, double replacement)?
Then you just need the ternary check here and you can customize with
Double.NEGATIVE_INFINITY in the call rather than passing a Bound.
Also, you only need one since the bound is always a double to begin with.
So you can replace the logic with:
max = replaceNaN(((DoubleColumnStatistics) columnStats).getMaximum(), Double.POSITIVE_INFINITY);
if (type.typeId() == Type.TypeID.FLOAT) {
max = ((Double) value).floatValue;
}
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3332 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLAXEQNIVK7XX2UOEYFGFLUJLJUDANCNFSM5GMSUBDA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
rdblue
approved these changes
Oct 29, 2021
Contributor
|
Thanks, @kbendick! |
kbendick
added a commit
to kbendick/iceberg
that referenced
this pull request
Nov 1, 2021
rdblue
pushed a commit
that referenced
this pull request
Nov 1, 2021
izchen
pushed a commit
to izchen/iceberg
that referenced
this pull request
Dec 7, 2021
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The OrcMetrics code assumed that Iceberg metrics would be available, but that isn't the case when importing existing ORC files.
This adds tests for the changes proposed in #3320 to support importing ORC files with floats and doubles in them due to this metrics behavior.
This also adds the same fix for the metrics situation for
max.I had to update the Spark3 Extensions gradle build file. But given that iceberg-data and iceberg-orc are already added as
implementationin the top level spark project, I'm wondering if this is ok? Otherwise I can find a different route to test it.